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Abstract — Statistical modeling of physical laws connects ex- 
periments with mathematical descriptions of natural phenomena. 
The modeling is based on the probability density of measured 
variables expressed by experimental data via a kernel estima- 
tor. As an objective kernel the scattering function determined 
by calibration of the instrument is introduced. This function 
provides for a new definition of experimental information and 
redundancy of experimentation in terms of information entropy. 
The redundancy increases with the number of experiments, while 
the experimental information converges to a value that describes 
the complexity of the data. The difference between the redun- 
dancy and the experimental information is proposed as the model 
cost function. From its minimum, a proper number of data in the 
model is estimated. As an optimal, nonparametric estimator of 
the relation between measured variables the conditional average 
extracted from the kernel estimator is proposed. The modeling 
is demonstrated on noisy chaotic data. 

Index Terms — kernel estimator, experimental information, 
complexity, redundancy, modeling of a physical law, model cost 
function, conditional average predictor, nonparemtric regression, 
predictor quality, noisy chaotic generator 



I. Introduction 

EXPERIMENTAL exploration of natural phenomena in- 
cludes measurements and descriptions of corresponding 
physical laws [1]. Modern experimental systems can perform 
measurements automatically, and therefore the question arises 
of how to develop a system for an automatic description of 
physical laws [2]. Tools involved in both tasks of exploration 
differ essentially in their character: measurements are based 
on devices and provide data about measured variables, while 
descriptions are based on mathematical methods and yield 
relations between these variables [1]. Nature has a tremendous 
variety of properties, which results in the diversity of functions 
applicable to mathematical modeling of corresponding physi- 
cal laws. On the contrary, a unique function is needed for an 
automatic modeling of physical laws in experimental systems. 
To bridge this gap, we employ the probability distribution [3] 
as a common basis for the description of natural properties and 
propose a nonparametric regression as a general method for the 
experimental modeling of physical laws. The corresponding 
statistical estimator is the conditional average (CA), which 
can be automatically extracted from the probability density 
function (PDF) in a measurement system [2]. 

For a nonparametric expression of the PDF, Parzen has 
proposed a kernel estimator [4], [5], and his method has been 
successfully applied to the statistical description of various 
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natural laws governing complex phenomena in a variety of 
fields [2], [6], [7]. However, a common weakness of these 
applications is the lack of an objective kernel function and 
a heuristic selection of the number of data representing 
the model. The same weakness is characteristic of several 
heuristic methods developed from Parzen's estimator in the 
fields of neural networks and artificial intelligence [8], [9]. 
Here we avoid this deficiency by specifying the kernel more 
objectively based upon calibration of the measurement system 
[4], [5]. This specification requires a statistical description of 
instrument output scattering during calibration, which further 
provides for definitions of the indeterminacy of measurements, 
experimental information, complexity of data, redundancy of 
measurements, information cost function and estimation of a 
proper number of data for modeling [3], [10], [11], [12], [13], 
[14], [15]. 

II. Estimation of probability distribution 

In order to introduce an objective kernel function we con- 
sider a phenomenon that can be explored experimentally by 
a setup containing only two sensors, since the generalization 
to more complex cases with several sensors is straightforward. 
The signals from the sensors are represented by the couple z = 
(x,y). We assume that the phenomenon can be characterized 
statistically by repetition of measurements yielding sample 
points in the span of the instrument S z . This span is a Cartesian 
product S x S y of spans corresponding to both channels. We 
assume that both spans are equal and given by the interval 
(-L,L). 

Measurements are generally subject to stochastic distur- 
bances or noise, which makes their outcomes uncertain [10]. 
The uncertainty is usually represented just by the standard 
deviation of variables during calibration [2], [10]. However, 
this is not sufficient to answer the following basic questions: 

1) How much information can be provided by measure- 
ments that are influenced by noise [10], [16]? 

2) How many experiments are needed for modeling a phys- 
ical law corresponding to the phenomenon [2], [11]? 

3) How complex should the model of this law be [12], [13], 
[11], [16], [17]? 

In the following we try to answer these questions based 
on information theory. With this aim we first describe the 
signal scattering during calibration of the instrument and then 
proceed to the uncertainty of experimental observation. 

For a simultaneous calibration of both instrument channels 
we have to perform a measurement on an object representing 
two physical units u x and u y which we together denote by 
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the joint unit u = (u x ,u y ). The scattering of instrument 
outputs during calibration is characterized by the joint PDF 
V>(z|u), which we call the scattering function (SF) [11], [2], 
[10]. When the interaction between both channels is negligible, 
the SF is given by the product i[){z\u) = i/j(x\u x )i/;(y\uy). 
Without loss of generality we further consider a case with 
equal sensors which are subject to mutually independent 
random disturbances that do not depend on u. In such cases 
probability theory suggests expressing the SF as ip(z — u) = 
g{x — u x , a)g(y — u y , er), where the Gaussian function 



U x , <J 



exp 



(x - u x ) 2 
2cr 



(1) 



describes the scattering of signal x. The parameters u x , a 
represent the mean value and standard deviation of this signal 
at the calibration and can be statistically estimated. 

When we perform a single measurement we get a sample 
zi = (x\,y\) that represents the mean value of z during 
measurement and, therefore, we express the PDF as ip(z — 
zi) = ip(x — xi)ip(y — yi). When we repeat the measurements 
TV times we get samples Zj, 1 < i < TV, with which we model 
the joint PDF by the statistical average: 



1 N 

/(z) = -]TV(z-z 4 ). 



(2) 



Properties of particular variables x, y are described by the 
marginal PDFs f(x),f(y). They are obtained from the joint 
PDF by integration with respect to one component, as for 
example: 



1 N 

f(x) = / f{t)dy = — 22ip(x-xi) 

J s y i=l 



(3) 



For the modeling of natural laws the most important is the 
conditional PDF of the variable y at a given value of x, defined 

as: 

/( z ) _ EiLi #z - z») 



f{y\x) 



/(*) £f=i#r-^) 



(4) 



III. Information statistics 



It is important that estimators ( 1213141 are expressed by data 
and an SF that can be completely determined by repetition 
of the experiment. However, the basic question is: how to 
select a proper number of data utilized in these estimators? 
To answer this question we next describe the indeterminacy 
of the variable z by the entropy of information [11]. For this 
purpose we first introduce a reference PDF that is constant 
within the span S z : p(z) = p(x)p{y) — l/(2/) 2 , and vanishes 
elsewhere, and define the indeterminacy of z by the negative 
relative information entropy [9], [10], [14]: 



H,, 



Jj /(z)log/(z)<fc<i!,-21og(2L). (5) 



By using the scattering function as the PDF, we get the 
uncertainty of the instrument calibration 



J J t/j(z-u)logip(z-u)dxdy-2log(2L) 



(6) 



The term 21og(cr/L) represents the lowest attainable uncer- 
tainty of measurement. The indeterminacy H z is generally 
greater than H u and we define the experimental information 
/(TV) by the difference 



/(TV) 



H v . — H„ 



= ~ J J f (z) log f(z)dxdy 

+ J J ip(z — u) log ip(z — u) dxdy. (7) 

The quantity I(N) represents the information provided by N 
experiments on an instrument that is subject to noise [11], [14]. 
When sample points zi, . . . ,zjy are separated by several <r, the 
distributions ijj(z — Zj) are not overlapping and Eq. (|7]i yields 
I(N) w log N. When distributions ip(z — z,) are overlapping 
we get < I(N) < log TV. 

In an exploration the gains of measurement channels are 
normally set so that sample points z, t are as evenly distributed 
as possible over the instrument span S z . In such a case the 
sample points are rather far apart when N is small and yield 
an approximately maximal possible value log TV of I(N). 
However, with increasing N, the experimental information 
I(N) increases more slowly than log N due to increasing over- 
lapping of distributions ip(z,Zi) and therefore, measurements 
become ever more redundant. The difference 



R(N) = log TV - /(TV) 



(8) 



thus represents the redundancy of repeated measurements in 
TV experiments. Since the overlapping of distributions ip(z — 
Zi) increases with TV, the experimental information converges 
to limit /(oo), and along with this, the redundancy increases 
logarithmically with TV [11]. 
The quantity 



K(N) 



(9) 



determines the number of non-overlapping distributions that 
represent the experimental observation. With increasing TV, the 
quantity TV (TV) converges to a limit /V(oo) that represents the 
complexity of the data [11], [17]. It is convenient from the 
experimental point of view that K(oo) can be well estimated 
from a finite number of experiments. We could conjecture 
that the proper number of experiments can be specified by 
iV(oo). However, this number can be even more conveniently 
estimated from the minimum of the cost function defined as 
the difference C(TV) = R(N) - /(TV) of the redundancy and 
the experimental information. The redundancy is R(N) = 
log TV — /(TV) and hence the cost function is: 



C(TV) = log TV - 2/(TV). 



(10) 



Since /(TV) is approximately log TV at small TV, and is ap- 
proximately constant for large TV, the cost function C(TV) 
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Fig. 1. The joint PDF /(z) utilized to demonstrate the properties of statistics 
/, R, C and the conditional average estimator. 

exhibits a minimum at a certain number N a . We consider N a 
as the proper number of experiments to be performed for the 
exploration and modeling of the phenomenon. Because the 
redundancy is equally accounted for in the cost function as 
the experimental information, it turns out that the model in 
Eq. (0 with N a data is a coarse estimator of the PDF [11]. 

A. Properties of Information Statistics 

To demonstrate the properties of statistics stemming from 
the information entropy we utilize the data produced by a 
noisy chaotic generator. This example is considered since 
it is often met in the exploration of complex and chaotic 
natural phenomena [2], [18], and since it makes comparison 
feasible with the expected properties of the modeling. At the 
generation of data the variables Xi and yi were comprised as: 
Xi = x 0il + n x ,i and y l = y 0ii + n y<i , where x 0li and y 0ji are 
two successive chaotic values that are related by a logistic map 
[2], [18], while terms n x ,i,n Vy i represent measurement noise 
calculated by independent random generators with zero mean 
and a standard deviation a = 0.2. This noise corresponds to 
the Gaussian SF ip(z) = g(x, 0.2)g(y, 0.2). 

For the demonstration we first formed the basic data set 
{xit Hi] with N = 200 samples. These data were used to esti- 
mate the joint PDF by Eq. (|2j. The graph of the estimated PDF 
is shown in Fig.Q] while graphs of corresponding experimental 
information /, redundancy R, and cost function C are shown 
in Fig. |2] In the same figure the maximal possible information 
is presented by the curve log N. 

The experimental information I(N) converges with increas- 
ing N to I(oo) « 3.8 which yields K (oo) « 45. Due to the 
convergence of experimental information the curve I(N) starts 
to deviate from log TV with increasing N. Consequently, the 
redundancy R — \ogN — I(N) starts to increase, and this 
leads to a minimum of the cost function C{N) — logiV — 
21 (N). The minimum, which occurs at N Q w 32, is not very 
pronounced due to statistical variations. N a is smaller but close 

to Koo. 

To demonstrate the influence of scattering width and statis- 
tical variation on the presented statistics the calculations were 
repeated for a — 0.1 and 0.4 with three different sample sets. 
The results are shown in Fig. [3] As could be expected, the 
limit value of / increases with decreasing a. This property is 
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Fig. 2. Dependence of log N, experimental information /, redundancy R, 
and cost function C on the number of samples N. Statistics are expressed in 
the natural unit of information not. 
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Fig. 3. Dependence of logA r , experimental information /, redundancy R, 
and cost function C on the number of samples AT, determined from various 
data sets at a = 0.1 and 0.4. 

consistent with the well-known fact that more information can 
be obtained by using an instrument of higher accuracy, which 
corresponds to a lesser scattering width. In contrast to this, the 
redundancy of measurement decreases, and along with it, the 
optimal number N Q increases with the decreasing scattering 
width. 

IV. Estimation of a Physical Law 

The example shown in Fig.Q]resembles a ridge along a line 
y (x) which we want to extract from the given data in an 
optimal way. For this purpose we select from a set of joint 
data only those that all have a certain value x. These joint 
data generally exhibit various values of y. We consider as an 
optimal predictor of the variable y from a given value x the 
value y p at which the mean square prediction error is minimal: 

E[(y p -y) 2 \x] = mm{y p ) (11) 

The minimum occurs where dE[(y p — y) 2 \x]/dy p = 0, which 
yields as the optimal predictor y p the conditional average: 

y p (x) = E[y\x] = f yf(y\x)dy (12) 



IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. ?, NO. ??, MONTH? 2006 



4 



By using Eq. (0 we express the conditional average as: 



y P (x) 



Xj,a) 



N 

E 



ViCi{x). 



The coefficients 



Ci(x) 



Xi,cr 



(13) 



(14) 



satisfy the conditions 



N 



Ci{x) = 1, < C l {x) < 1. 



(15) 



The coefficient Ci(x) can be interpreted as a normalized 
measure of similarity between the given x and the sample Xi. 
The calculation of y p (x) corresponds to an associative recall 
of memorized items, which is a property of an intelligence. 
Therefore, the estimator y p (x) could be treated as a basis for 
the development of machine intelligence based on modeling 
of natural laws [2], [8]. 

A predictor maps the stochastic variable x to a new stochas- 
tic variable y p that generally differs from the variable y. When 
the variables x,y are related by some physical law and the 
measurement noise is small, we expect that the first and second 
statistical moments E[y — y p ], E[(y — y p ) 2 ] of the prediction 
error are also small. The second moment is: E[(y — y p ) 2 } = 
Var(y) + Var(y p ) - 2Cov(y,y p ) + [m(y) - m(y p )} 2 , where 
E, m, Var, Cov denote statistical average, mean value, vari- 
ance and covariance, respectively. In the case of statistically 
independent variables y and y p with equal mean values we 
get: E[(y — y P ) 2 ] = Var(y) + Var(j/ P ). With respect to this 
property we define the predictor quality by the formula 



1 



Var(y) + Var(y p ) 
2Cov(y, y p ) [m(y) 



m (2/p)] 2 



Var(y) + Var(y p ) Var(y) + Var(y p ) 



(16) 



The quality is 1 if the prediction is exact: y p = y, while it is 
if y and y p are statistically independent and have equal mean 
values. The quality Q may be negative if m(y) ^ m(y p ). 

For the predictor defined by the conditional average 
y p (x) — J y f(y\x) dy, we analytically obtain the equalities: 
m(y) = m(y p ) and Cov(y,y p ) = Var(y p ), which yield 



Q 



2Var(y p ) 



Var(y) + Var(y p ) ' 



(17) 



From the definition of the conditional average it follows 
< Var(y p ) < Var(y) and therefore < Q < 1. This 
inequality need not be fulfilled exactly if CA is statistically 
estimated from a finite number of samples. With increasing 
N we generally expect that the CA statistically estimated by 
Eq. ( fT3l increasingly better represents the underlying physical 
law. In relation to this expectation there arises the question of 
whether the number N a of data yields a judicious estimation 
of the underlying law. 




Fig. 4. Testing of the CA predictor. Graphs represent the underlying law y 
and given data y - (top two), test yt and predicted data y p - (middle two), 
and prediction error y P — yt - (bottom). Graphs are displaced in the vertical 
direction for better visualization. 



A. Properties of CA Predictor 

To answer the last question we demonstrate the properties 
of the CA predictor for the case of noise-corrupted chaotic 
data with standard deviation of scattering a = 0.2. From the 
set of 200 data that were used when estimating the joint PDF 
by Eq. (O, a reduced set {xi,yi\ i = 1,...,N — 50} was 
utilized for the sake of clear presentation. These basic data 
are shown by stars in the top curve of Fig. [4] together with the 
underlying law y (x). 

The conditional average predictor was modeled by inserting 
data from the basic data set into Eq. (fl~3b . To demonstrate its 
performance, we additionally generated a test data set Xi^,yi,t 
with different seeds of random generators. Using the values 
Xi t t of the test set, we then calculated the corresponding values 
of y p by the modeled CA predictor. The test and predicted data 
are shown by the middle two curves in Fig. [4] The prediction 
error y p —yu calculated from both data sets, is presented by the 
bottom curve in Fig. [4] The curve joining the predicted data 
is smoother than the curve joining the original test data. The 
smoothing is a consequence of estimating the CA from various 
data yi from the basic data set. In spite of this difference 
between both curves, we can intuitively conclude that rough 
properties of the hidden law y Q (x) are properly revealed by 
the predictor. 

The quality of the CA predictor depends on sets of samples 
utilized in statistical modeling and testing. To demonstrate 
this dependence, we repeated the modeling and testing three 
times, using various statistical sample sets with increasing N. 
The estimated predictor quality Q is presented in Fig. [5] as a 
function of the number of samples. For each data set the statis- 
tical fluctuations decrease with increasing N, so that qualities 
converge to the same limit. With increasing N, the curves 
determined from different data sets merge approximately at 
the number Nca — 15. At the previously determined optimal 
number N Q = 32 the quality is above 0.99. The difference 
between curves is there about two orders of magnitude smaller 
than the corresponding quality and apparently disappears at 
K(oo) ps 45. With respect to these properties we argue that 
in the present case about N Q data values already provides for 
a judicious modeling of the underlying law y (x) by the CA 
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Fig. 5. Dependence of predictor quality Q on the number of samples N 
determined by various statistical data sets. 

predictor. 

The quality of the CA predictor exhibits a convergence 
to some limit value that characterizes the applicability of 
the modeling. The limit value generally increases with the 
decreasing scattering width a, but on average Q is less than 
1 if I/a and N are finite. This means that it is not possible 
to determine exactly the underlying physical law y = y (x) 
based on joint data obtained by an instrument influenced by 
stochastic disturbances. 



V. Conclusion 

Our approach indicates that the objectively introduced ker- 
nel estimator provides for a nonparametric statistical modeling 
of an explored phenomenon that can be automatically per- 
formed by a computer in a measurement system. Estimated 
statistics provide answers to basic questions about the quantity 
of information provided by experiments, the complexity of the 
data, the proper number of data needed for the modeling, and 
the quality of the predictor of the underlying physical law. 

The estimated physical law represents the distribution of 
the variable y at a given value a; by a single value y p {x). 
More in tune with our interpretation, the corresponding con- 
ditional PDF is mapped to the scattering function f(y\x) 
tp(y — y p (x)). Such a mapping is generally accompanied by 
a reduction of the entropy of information that corresponds 
to certain information gain. This property is opposite to the 
loss of information caused by stochastic disturbances in signal 
transmission channels [19]. If the gaining of information from 
observations is considered as a basis of natural intelligence 
[8], [9], then a system that is capable of estimating a physical 
law from measured data autonomously has to be treated as an 
intelligent unit whose level of intelligence can be quantified 
by the information gain. Such an interpretation provides a 
common basis for a unified treatment of experimental sciences 
and natural or artificial intelligence [2], [8]. 
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